GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader#48925
GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader#48925pitrou merged 1 commit intoapache:mainfrom
Conversation
|
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
c559b54 to
7749642
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: 7749642 Submitted crossbow builds: ursacomputing/crossbow @ actions-cab35a473b |
7749642 to
a4ae909
Compare
|
I'm not overly familiar with this part of Arrow, but generally things look good to me. Happy to offer an explicit approval if desired and no feedback from others |
|
After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit 8010794. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
After merging your PR, Conbench analyzed the 2 benchmarking runs that have been run so far on merge-commit 8010794. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### What changes are included in this PR? Bug fixes and robustness improvements in the IPC file reader: * Fix bug reading variadic buffers with pre-buffering enabled * Fix bug reading dictionaries with pre-buffering enabled * Validate IPC buffer offsets and lengths Testing improvements: * Exercise pre-buffering in IPC tests * Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated * Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job * Exercise pre-buffering in the IPC file fuzz target Miscellaneous: * Add convenience functions for integer overflow checking ### Are these changes tested? Yes, by existing and improved tests. ### Are there any user-facing changes? Bug fixes. **This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled. * GitHub Issue: #48924 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968)
### Rationale for this change
Cython built code is currently failing to compile on free threaded wheels due to:
```
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous
43068 | __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL));
|
```
### What changes are included in this PR?
Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`.
### Are these changes tested?
Yes via archery.
### Are there any user-facing changes?
No
* GitHub Issue: #48965
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925)
### What changes are included in this PR?
Bug fixes and robustness improvements in the IPC file reader:
* Fix bug reading variadic buffers with pre-buffering enabled
* Fix bug reading dictionaries with pre-buffering enabled
* Validate IPC buffer offsets and lengths
Testing improvements:
* Exercise pre-buffering in IPC tests
* Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
* Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
* Exercise pre-buffering in the IPC file fuzz target
Miscellaneous:
* Add convenience functions for integer overflow checking
### Are these changes tested?
Yes, by existing and improved tests.
### Are there any user-facing changes?
Bug fixes.
**This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled.
* GitHub Issue: #48924
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967)
### Rationale for this change
The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled
### What changes are included in this PR?
1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields.
2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys.
### Are these changes tested?
Manually on Windows, and CI
### Are there any user-facing changes?
No
* GitHub Issue: #48966
Authored-by: jianfengmao <jianfengmao@deephaven.io>
Signed-off-by: David Li <li.davidm96@gmail.com>
* GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692)
### Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.
### What changes are included in this PR?
Early check the array is not all null values before serialize it
### Are these changes tested?
Added tests.
### Are there any user-facing changes?
No
* GitHub Issue: #48691
Authored-by: rexan <rexan@apache.org>
Signed-off-by: Gang Wu <ustcwg@gmail.com>
* GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948)
### Rationale for this change
As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix
### What changes are included in this PR?
- Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker.
- Update `pymanager install` command to use newer API (old command fails with missing flags)
- Update default python command to use the free-threaded required suffix if free-threaded wheels
### Are these changes tested?
Yes via archery
### Are there any user-facing changes?
No
* GitHub Issue: #48947
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48990: [Ruby] Add support for writing date arrays (#48991)
### Rationale for this change
There are date32 and date64 variants for date arrays.
### What changes are included in this PR?
* Add `ArrowFormat::DateType#to_flatbuffers`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48990
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993)
### Rationale for this change
It's a large variant of UTF-8 array.
### What changes are included in this PR?
* Add `ArrowFormat::LargeUTF8Type#to_flatbuffers`
* Add support for large UTF-8 array of `#values` and `#raw_records`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48992
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982)
### Rationale for this change
`FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter.
### What changes are included in this PR?
Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation:
- Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods
- Deprecate the old Status/out-parameter overloads
- Update C++ callers and R/Python/GLib bindings to use the new API
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
Status versions of FileReader::ReadRowGroup(s) have been deprecated.
```cpp
virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices,
std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
const std::vector<int>& column_indices,
std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
std::shared_ptr<::arrow::Table>* out);
```
* GitHub Issue: #48949
Lead-authored-by: fenfeng9 <fenfeng9@qq.com>
Co-authored-by: fenfeng9 <36840213+fenfeng9@users.noreply.github.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989)
### Rationale for this change
Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC.
### What changes are included in this PR?
* Refer arguments of `garrow_filter_node_options_new()`
* Refer arguments of `garrow_project_node_options_new()`
* Refer arguments of `garrow_aggregate_node_options_new()`
* Refer arguments of `garrow_literal_expression_new()`
* Refer arguments of `garrow_call_expression_new()`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48985
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007)
### Rationale for this change
When looking for the wheel the script was falling back to returning a 404 even when the wheel was found:
```
+ python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome
127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found
```
Timing out the job and failing.
### What changes are included in this PR?
Correct logic and only return 404 if the file requested wasn't found.
### Are these changes tested?
Yes via archery
### Are there any user-facing changes?
No
* GitHub Issue: #47692
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974)
### Rationale for this change
Benchmark failing since C++20 upgrade due to lack of C++20 configuration
### What changes are included in this PR?
Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach.
Description as follows:
> conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty.
>
> This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present.
### Are these changes tested?
I got :robot: to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly.
> Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch.
>
> The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding.
### Are there any user-facing changes?
Nope
* GitHub Issue: #48912
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718)
### Rationale for this change
Fixes https://github.com/apache/arrow/issues/36889
When writing CSV from a table where the first batch is empty, the header gets written twice:
```python
table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n <-- header appears twice
```
### What changes are included in this PR?
The bug happens because:
1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization
2. The buffer is not cleared after flush
3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_`
4. The write loop then writes `data_buffer_` which still contains stale content
The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths:
- `WriteHeader()`
- `WriteRecordBatch()`
- `WriteTable()`
This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.
### Are these changes tested?
Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
- Empty batch at start of table
- Empty batch in middle of table
### Are there any user-facing changes?
No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.
* GitHub Issue: #36889
Lead-authored-by: Ruiyang Wang <ruiyang@anthropic.com>
Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>
* GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933)
### Rationale for this change
#48932
### What changes are included in this PR?
- Fix `rsync` build error ODBC Nightly Package
### Are these changes tested?
- tested in CI
### Are there any user-facing changes?
- After fix, users should be able to get Nightly ODBC package release
* GitHub Issue: #48932
Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48951: [Docs] Add documentation relating to AI tooling (#48952)
### Rationale for this change
Add guidance re AI tooling
### What changes are included in this PR?
Updates to main docs and links to it from new contributor's guide
### Are these changes tested?
No but I'll built the docs
### Are there any user-facing changes?
Just docs
:robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness.
* GitHub Issue: #48951
Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49029: [Doc] Run sphinx-build in parallel (#49026)
### Rationale for this change
`sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs).
### Are these changes tested?
By existing CI jobs.
### Are there any user-facing changes?
No.
* GitHub Issue: #49029
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-33450: [C++] Remove GlobalForkSafeMutex (#49033)
### Rationale for this change
This functionality is unused now that we have a proper atfork facility.
### Are these changes tested?
By existing CI tests.
### Are there any user-facing changes?
Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal).
* GitHub Issue: #33450
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956)
### Rationale for this change
The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete.
It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies.
The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved.
### What changes are included in this PR?
Removed the outdated TODO comment that referenced GH-35437.
### Are these changes tested?
I did not test.
### Are there any user-facing changes?
No.
* GitHub Issue: #35437
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008)
### Rationale for this change
When running the python-sdist job we are currently not uploading the build artifact to the job.
### What changes are included in this PR?
Upload artifact as part of building the job so it's easier to test and validate contents if necessary.
### Are these changes tested?
Yes via archery.
### Are there any user-facing changes?
No
* GitHub Issue: #48586
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039)
### Rationale for this change
CI needs updating to test old R package versions
### What changes are included in this PR?
Add 22.0.0.1
### Are these changes tested?
Nah, it's CI stuff
### Are there any user-facing changes?
No
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969)
### Rationale for this change
See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default
### What changes are included in this PR?
Updating several doctest examples from `string` to `large_string`.
### Are these changes tested?
Yes, locally.
### Are there any user-facing changes?
No.
Closes #48961
* GitHub Issue: #48961
Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-49037: [Benchmarking] Install R from non-conda source for benchmarking (#49038)
### Rationale for this change
Slow benchmarks due to conda duckdb building from source
### What changes are included in this PR?
Try ditching conda and installing R via rig and using PPM binaries
### Are these changes tested?
I'll try running
### Are there any user-facing changes?
Nope
* GitHub Issue: #49037
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49042: [C++] Remove mimalloc patch (#49041)
### Rationale for this change
This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139
### Are these changes tested?
By existing CI.
### Are there any user-facing changes?
No.
* GitHub Issue: #49042
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49024: [CI] Update Debian version in `.env` (#49032)
### Rationale for this change
Default Debian version in `.env` now maps to oldstable, we should use stable instead.
Also prune entries that are not used anymore.
### Are these changes tested?
By existing CI jobs.
### Are there any user-facing changes?
No.
* GitHub Issue: #49024
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49027: [Ruby] Add support for writing time arrays (#49028)
### Rationale for this change
There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays.
### What changes are included in this PR?
* Add `ArrowFormat::TimeType#to_flatbuffers`
* Add bit width information to `ArrowFormat::TimeType`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49027
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49030: [Ruby] Add support for writing fixed size binary array (#49031)
### Rationale for this change
It's a fixed size variant of binary array.
### What changes are included in this PR?
* Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers`
* Add `ArrowFormat::FixedSizeBinaryArray#each_buffer`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49030
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867)
### Rationale for this change
Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations.
### What changes are included in this PR?
- Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error
- Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases
### Are these changes tested?
Yes
### Are there any user-facing changes?
No
* GitHub Issue: #48866
Authored-by: Arkadii Kravchuk <arkadii.kravchuk@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674)
### Rationale for this change
This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers.
I could not find the relevant example to demonstrate within this project but assume that we have a test such as:
(Generated by ChatGPT)
```cpp
TEST(BlockParser, ErrorMessageWithColonsPreserved) {
Status st(StatusCode::Invalid,
"CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
"Error details: Time format: 12:34:56, Key: value\n"
"parser_test.cc:940 Parse(parser, csv, &out_size)");
std::string expected_msg =
"Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
"Error details: Time format: 12:34:56, Key: value";
ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
// Test with URL-like data (another common case with colons)
TEST(BlockParser, ErrorMessageWithURLPreserved) {
Status st(StatusCode::Invalid,
"CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
"URL: http://arrow.apache.org:8080/api\n"
"parser_test.cc:974 Parse(parser, csv, &out_size)");
std::string expected_msg =
"Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
"URL: http://arrow.apache.org:8080/api";
ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
```
then it fails.
### What changes are included in this PR?
Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped.
### Are these changes tested?
Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`.
### Are there any user-facing changes?
No, test-only.
* GitHub Issue: #48673
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052)
### Rationale for this change
See: #49044
### What changes are included in this PR?
Urllib now request with `"user-agent": "pyarrow"`
### Are these changes tested?
It's a CI fix.
### Are there any user-facing changes?
No, just a CI test fix.
* GitHub Issue: #49044
Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988)
### Rationale for this change
Currently the files are missing from the published wheels.
### What changes are included in this PR?
- Ensure the license and notice files are part of the wheels
- Use build frontend to build wheels
- Build wheel from sdist
### Are these changes tested?
Yes, via archery.
I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing:
```
AssertionError: LICENSE.txt is missing from the wheel.
```
### Are there any user-facing changes?
No
* GitHub Issue: #48983
Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060)
### Rationale for this change
Fix two issues found by OSS-Fuzz in the IPC reader:
* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408
None of these two issues is a security issue.
### Are these changes tested?
Yes, by new unit tests and new fuzz regression files.
### Are there any user-facing changes?
No.
**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)
* GitHub Issue: #49059
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056)
### Rationale for this change
Decimal128/256 arrays are only supported.
### What changes are included in this PR?
Add `ArrowFormat::DecimalType#to_flatbuffers`.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49055
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49053: [Ruby] Add support for writing timestamp array (#49054)
### Rationale for this change
It has `unit` and `time_zone` parameters.
### What changes are included in this PR?
* Add `ArrowFormat::TimestampType#to_flatbuffers`
* Set time zone when GLib timestamp type is converted from C++ timestamp type
* Use `time_zone` not `timezone`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49053
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619)
### Rationale for this change
In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds.
### What changes are included in this PR?
IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation.
### Are these changes tested?
Yes, with the CI.
### Are there any user-facing changes?
Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst`
* GitHub Issue: #28859
Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: tadeja <tadeja@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066)
### Rationale for this change
The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types.
### What changes are included in this PR?
Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership.
### Are these changes tested?
Yes, existing tests.
### Are there any user-facing changes?
No.
* GitHub Issue: #49065
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063)
### Rationale for this change
Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed.
### What changes are included in this PR?
Refactor the Engine class to only create one target machine and pass that to the necessary functions.
Before the change (3 TargetMachines created):
First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout.
Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler.
Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine.
After the change (1 TargetMachine created):
The key changes are:
Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine).
Use SimpleCompiler instead of TMOwningSimpleCompiler:
SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created.
A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance.
### Are these changes tested?
Yes, unit and integration.
### Are there any user-facing changes?
No.
* GitHub Issue: #48159
Lead-authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Co-authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049)
### Rationale for this change
Prevent bugs similar to https://github.com/apache/arrow/issues/49043
### What changes are included in this PR?
- Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`.
- Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response.
### Are these changes tested?
Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression.
The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce.
### Are there any user-facing changes?
Some low probability bugs will be gone. No interface changes.
* GitHub Issue: #49043
Authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035)
### Rationale for this change
The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error.
### What changes are included in this PR?
Add kCanReturnErrors to the function definition to match other string functions.
Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation.
Add a unit test.
### Are these changes tested?
Yes, unit and integration testing.
### Are there any user-facing changes?
No.
* GitHub Issue: #49034
Authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981)
### Rationale for this change
Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11).
### What changes are included in this PR?
Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments.
### Are these changes tested?
Yes, through CI build and existing tests.
### Are there any user-facing changes?
No.
* GitHub Issue: #48980
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49069: [C++] Share Trie instances across CSV value decoders (#49070)
### Rationale for this change
The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead.
### What changes are included in this PR?
- Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie)
- Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries
- Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders
### Are these changes tested?
Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage.
### Are there any user-facing changes?
No.
* GitHub Issue: #49069
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49076: [CI] Update vcpkg baseline to newer version (#49062)
### Rationale for this change
The current version of vcpkg used is a from April 2025
### What changes are included in this PR?
Update baseline to newer version.
### Are these changes tested?
Yes on CI. I've validated for example that xsimd 14 will be pulled.
### Are there any user-facing changes?
No
* GitHub Issue: #49076
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49074: [Ruby] Add support for writing interval arrays (#49075)
### Rationale for this change
There are year month/day time/month day nano variants.
### What changes are included in this PR?
* Add `ArrowFormat::IntervalType#to_flatbuffers`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49074
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49071: [Ruby] Add support for writing list and large list arrays (#49072)
### Rationale for this change
They use different offset size.
### What changes are included in this PR?
* Add `ArrowFormat::ListType#to_flatbuffers`
* Add `ArrowFormat::LargeListType#to_flatbuffers`
* Add `ArrowFormat::VariableSizeListArray#child`
* Add `ArrowFormat::VariableSizeListArray#each_buffer`
* `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist
* Add `garrow_list_array_get_value_offsets_buffer()`
* Add `garrow_large_list_array_get_value_offsets_buffer()`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49071
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091)
### Rationale for this change
LLVM 15 or earlier uses `llvm::Optional` not `std::optional`.
### What changes are included in this PR?
Use `llvm::Optional` with LLVM 15 or earlier.
### Are these changes tested?
Yes, compiling.
### Are there any user-facing changes?
No
* GitHub Issue: #49087
Authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101)
### Rationale for this change
The Swift documentation link in the implementations.rst file was broken and returned a 404 error.
### What changes are included in this PR?
Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow)
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #49100
Lead-authored-by: ChiLin Chiu <chilin.chiou@gmail.com>
Co-authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49096: [Ruby] Add support for writing struct array (#49097)
### Rationale for this change
It's a nested array.
### What changes are included in this PR?
* Add `ArrowFormat::StructType#to_flatbuffers`
* Add `ArrowFormat::StructArray#each_buffer`
* Add `ArrowFormat::StructArray#children`
* Fix `ArrowFormat::Array#n_nulls`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49096
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49093: [Ruby] Add support for writing duration array (#49094)
### Rationale for this change
It has unit parameter.
### What changes are included in this PR?
* Add `ArrowFormat::DurationType#to_flatbuffers`
* Add duration support to `#values` and `raw_records`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49093
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099)
### Rationale for this change
Documents for libarrow-cuda-glib are generated but they aren't packaged.
### What changes are included in this PR?
Package documents for libarrow-cuda-glib.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49098
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48764: [C++] Update xsimd (#48765)
### Rationale for this change
Homogenized versions used
### What changes are included in this PR?
Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines.
### Are these changes tested?
Yes, with current CI.
In fact due to the absence of pin, part of the CI already runs xsimd 14.
### Are there any user-facing changes?
No.
* GitHub Issue: #48764
Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047)
### Rationale for this change
As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.
### What changes are included in this PR?
Remove asv benchmarking related files and docs.
### Are these changes tested?
No, Validate CI and run preview-docs to validate docs.
### Are there any user-facing changes?
No
* GitHub Issue: #46008
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109)
### Rationale for this change
`SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix.
### What changes are included in this PR?
Add f prefix to the string in `SparseCOOTensor.__repr__`.
### Are these changes tested?
Yes, work after adding. f-string prefix:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: float
shape: (2, 3)
```
### Are there any user-facing changes?
a bug that caused incorrect or invalid data to be produced:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: {self.type}
shape: {self.shape}
```
* GitHub Issue: #49108
Authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126)
### Rationale for this change
Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel).
### What changes are included in this PR?
Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025.
### Are these changes tested?
Yes, with extendeed dask build.
### Are there any user-facing changes?
No.
* GitHub Issue: #49083
Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49117: [Ruby] Add support for writing union arrays (#49118)
### Rationale for this change
There are dense and sparse variants.
### What changes are included in this PR?
* Add `garrow_union_array_get_n_fields()`
* Add `ArrowFormat::UnionArray#children`
* Add `ArrowFormat::DenseUnionArray#each_buffer`
* Add `ArrowFormat::SparseUnionArray#each_buffer`
* Add `ArrowFormat::UnionType#to_flatbuffers`
* Add `Arrow::UnionArray#fields`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49117
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49119: [Ruby] Add support for writing map array (#49120)
### Rationale for this change
It's a list based array.
### What changes are included in this PR?
* Add `ArrowFormat::MapType#to_flatbuffers`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49119
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48922: [C++] Support Status-returning callables in Result::Map (#49127)
### Rationale for this change
Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases.
### What changes are included in this PR?
- Added EnsureResult specialization to allow Map to return Status directly.
- Added unit tests to verify success/error propagation and return type resolution.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No
* GitHub Issue: #48922
Authored-by: Abhishek Bansal <abhibansal593@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095)
### Rationale for this change
This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal.
`fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases.
### What changes are included in this PR?
Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion.
### Are these changes tested?
Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}.
### Are there any user-facing changes?
It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`.
With this patch, the CSV reader in PyArrow outputs:
```python
>>> import pyarrow
>>> import pyarrow.csv
>>> import io
>>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode()))
>>> print(table)
pyarrow.Table
data: double
----
data: [[0,inf,-inf]]
```
Closes #49003
* GitHub Issue: #49003
Authored-by: Alvaro-Kothe <kothe65@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943)
### Rationale for this change
The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.
### What changes are included in this PR?
Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.
### Are these changes tested?
There are existent tests for JSON.
### Are there any user-facing changes?
No, test-only.
* GitHub Issue: #48941
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49067: [R] Disable GCS on macos (#49068)
### Rationale for this change
Builds that complete on CRAN
### What changes are included in this PR?
Disable GCS by default
### Are these changes tested?
### Are there any user-facing changes?
Hopefully not
**This PR includes breaking changes to public APIs.** (If there are any
breaking changes to public APIs, please explain which changes are
breaking. If not, you can remove this.)
**This PR contains a "Critical Fix".** (If the changes fix either (a) a
security vulnerability, (b) a bug that caused incorrect or invalid data
to be produced, or (c) a bug that causes a crash (even when the API
contract is upheld), please provide explanation. If not, you can remove
this.)
* GitHub Issue: #49067
---------
Co-authored-by: Nic Crane <thisisnic@gmail.com>
* GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116)
### Rationale for this change
Current wheels are failing to be built due to old version of vcpkg failing with our latest main.
### What changes are included in this PR?
- Update vcpkg version.
- Update patches
- Add `perl-Time-Piece` to some images as required to build newer OpenSSL.
### Are these changes tested?
Yes on CI
### Are there any user-facing changes?
No
* GitHub Issue: #49115
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955)
### Rationale for this change
Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this:
```python
import pyarrow as pa
import pyarrow.compute as pc
pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32()))
# [0, 1, 2, 3]
pc.array_sort_indices(pa.DictionaryArray.from_arrays(
indices=pa.array([None, None, None, None], type=pa.int8()),
dictionary=pa.array([], type=pa.null())
))
# [0, 1, 2, 3]
```
I believe it does not make sense to specifically disallow this in dictionaries at this point.
### What changes are included in this PR?
Added a unittest for null sorting behaviour.
### Are these changes tested?
Yes, the unittest was added.
### Are there any user-facing changes?
No.
* GitHub Issue: #48954
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-36193: [R] arm64 binaries for R (#48574)
### Rationale for this change
Issues building on ARM
### What changes are included in this PR?
CI job and nixlibs update
### Are these changes tested?
On CI
### Are there any user-facing changes?
No
AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc)
* GitHub Issue: #36193
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-48397: [R] Update docs on how to get our libarrow builds (#48995)
### Rationale for this change
Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that.
### What changes are included in this PR?
Update docs.
### Are these changes tested?
Will preview docs build.
### Are there any user-facing changes?
Just docs.
* GitHub Issue: #48397
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105)
### Rationale for This Change
The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ.
### What Changes Are Included in This PR?
This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access.
### Are These Changes Tested?
Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically.
### Are There Any User-Facing Changes?
No. This change improves internal safety and robustness without altering public APIs or observable user behavior.
* GitHub Issue: #49104
Lead-authored-by: Alirana2829 <alimahmoodrana00@gmail.com>
Co-authored-by: Ali Mahmood Rana <159713825+AliRana30@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
* MINOR: [Docs] Add links to AI-generated code guidance (#49131)
### Rationale for this change
Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though
### What changes are included in this PR?
Add link to AI-generated code guidance
### Are these changes tested?
No
### Are there any user-facing changes?
No
Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* MINOR: [R] Add new vignette to pkgdown config (#49145)
### Rationale for this change
CI failing on preview-docs; see #49141
### What changes are included in this PR?
Add the vignette created in #49068 to pkgdown config
### Are these changes tested?
I'll trigger CI
### Are there any user-facing changes?
Nah
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088)
Fixes: #49150
See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381
### Rationale for this change
Fix CI failures
### What changes are included in this PR?
Tests are made more general to allow for Pandas 2 and Pandas 3 style string types
### Are these changes tested?
By CI
### Are there any user-facing changes?
No
* GitHub Issue: #49150
Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
* GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971)
Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix.
### Rationale for this change
I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced.
### What changes are included in this PR?
AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang.
This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type.
Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above.
### Are these changes tested?
I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows.
One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it.
### Are there any user-facing changes?
Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those.
* GitHub Issue: #41990
Lead-authored-by: Nate Prewitt <nateprewitt@microsoft.com>
Co-authored-by: Nate Prewitt <nate.prewitt@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139)
### Rationale for this change
We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138
### What changes are included in this PR?
Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2).
### Are these changes tested?
Tes.
### Are there any user-facing changes?
No.
* GitHub Issue: #49138
Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769)
### Rationale for this change
Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics).
The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working.
### What changes are included in this PR?
- Changed validation from `start >= stop` to `start > stop`
- Updated error message
- Added test cases
### Are these changes tested?
Yes, tests were added.
### Are there any user-facing changes?
Yes.
```python
import pyarrow.compute as pc
pc.list_slice([[1,2,3]], 0, 0)
```
Before:
```
pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0)
```
After:
```
<pyarrow.lib.ListArray object at 0x1a01b8b20>
[
[]
]
```
* GitHub Issue: #33459
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135)
Closes https://github.com/apache/arrow/issues/41863
### Rationale for this change
Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/
`LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec.
However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec:
```
ArrowException: Unsupported compression: lz4_raw
```
This is a friction issue, and confusing for some users who are aware of the differences.
### What changes are included in this PR?
- Adding `LZ4_RAW` to the acceptable codec names list.
- Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
- Adding a test
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes, an additive change to the accepted codec names.
* GitHub Issue: #41863
Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-48868: [Doc] Document security model for the Arrow formats (#48870)
### Rationale for this change
Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.
### What changes are included in this PR?
Add a Security Considerations page in the Format section.
**Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html
### Are these changes tested?
N/A
### Are there any user-facing changes?
No.
* GitHub Issue: #48868
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005)
### Rationale for this change
#49004
### What changes are included in this PR?
- Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI.
Note: `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050
### Are these changes tested?
Yes, in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49004
Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: Alina (Xi) Li <96995091+alinaliBQ@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151)
### Rationale for this change
#49092
### What changes are included in this PR?
- Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly.
Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`.
### Are these changes tested?
Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26
### Are there any user-facing changes?
Yes, the nightly ODBC file names will be changed as described above.
* GitHub Issue: #49092
Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49156: [Python] Require GIL for string comparison (#49161)
### Rationale for this change
With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL.
### What changes are included in this PR?
Moving statement out of the `with nogil` context manager.
### Are these changes tested?
Existing CI builds pyarrow.
### Are there any user-facing changes?
No
* GitHub Issue: #49156
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577)
### Rationale for this change
#48575
### What changes are included in this PR?
- Add new ODBC workflow for macOS Intel 15 and 14 arm64.
- Added ODBC build fixes to enable build on macOS CI.
### Are these changes tested?
Tested in CI and local macOS Intel and M1 environments.
### Are there any user-facing changes?
N/A
* GitHub Issue: #48575
Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: justing-bq <62349012+justing-bq@users.noreply.github.com>
Co-authored-by: Victor Tsang <victor.tsang@improving.com>
Co-authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Co-authored-by: vic-tsang <victor.tsang@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165)
### Rationale for this change
Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments.
### What changes are included in this PR?
Use the variable name directly (no `${}`).
### Are these changes tested?
Yes.
### Are there any user-facing changes?
None.
* GitHub Issue: #49164
Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48132: [Ruby] Add support for writing dictionary array (#49175)
### Rationale for this change
Delta dictionary message support is out of scope.
### What changes are included in this PR?
* Add `ArrowFormat::DictionaryArray#each_buffer`
* Add `ArrowFormat::DictionaryType#build_fb_type`
* Add support for dictionary message in `ArrowFormat::StreamingWriter`
* Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48132
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49081: [C++][Parquet] Correct variant's extension name (#49082)
### Rationale for this change
Correct variant extension according to arrow's specification.
### What changes are included in this PR?
Modified variant's hardcoded extension name.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #49081
Authored-by: Zehua Zou <zehuazou2000@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>
* GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618)
### Rationale for this change
This is the first in series of PRs adding type annotations to pyarrow and resolving #32609.
### What changes are included in this PR?
This PR establishes infrastructure for type checking:
- Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows
- Configures type checkers to validate stub files (excluding source files for now)
- Adds PEP 561 `py.typed` marker to enable type checking
- Updates wheel build scripts to include stub files in distributions
- Creates initial minimal stub directory structure
- Updates developer documentation with type checking workflow
### Are these changes tested?
No. This is mostly a CI change.
### Are there any user-facing changes?
This does not add any actual annotations (only `py.typed` marker) so user should not be affected.
* GitHub Issue: #32609
* GitHub Issue: #49102
Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
* GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192)
### Rationale for this change
See #49190
### What changes are included in this PR?
Fix `unknown job 'odbc' error` caused by typo
### Are these changes tested?
Tested in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49190
Authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191)
Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p>
<blockquote>
<h2>v3.7.0</h2>
<ul>
<li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li>
<li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li>
<li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect…
* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968)
### Rationale for this change
Cython built code is currently failing to compile on free threaded wheels due to:
```
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous
43068 | __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL));
|
```
### What changes are included in this PR?
Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`.
### Are these changes tested?
Yes via archery.
### Are there any user-facing changes?
No
* GitHub Issue: #48965
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925)
### What changes are included in this PR?
Bug fixes and robustness improvements in the IPC file reader:
* Fix bug reading variadic buffers with pre-buffering enabled
* Fix bug reading dictionaries with pre-buffering enabled
* Validate IPC buffer offsets and lengths
Testing improvements:
* Exercise pre-buffering in IPC tests
* Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
* Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
* Exercise pre-buffering in the IPC file fuzz target
Miscellaneous:
* Add convenience functions for integer overflow checking
### Are these changes tested?
Yes, by existing and improved tests.
### Are there any user-facing changes?
Bug fixes.
**This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled.
* GitHub Issue: #48924
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967)
### Rationale for this change
The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled
### What changes are included in this PR?
1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields.
2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys.
### Are these changes tested?
Manually on Windows, and CI
### Are there any user-facing changes?
No
* GitHub Issue: #48966
Authored-by: jianfengmao <jianfengmao@deephaven.io>
Signed-off-by: David Li <li.davidm96@gmail.com>
* GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692)
### Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.
### What changes are included in this PR?
Early check the array is not all null values before serialize it
### Are these changes tested?
Added tests.
### Are there any user-facing changes?
No
* GitHub Issue: #48691
Authored-by: rexan <rexan@apache.org>
Signed-off-by: Gang Wu <ustcwg@gmail.com>
* GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948)
### Rationale for this change
As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix
### What changes are included in this PR?
- Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker.
- Update `pymanager install` command to use newer API (old command fails with missing flags)
- Update default python command to use the free-threaded required suffix if free-threaded wheels
### Are these changes tested?
Yes via archery
### Are there any user-facing changes?
No
* GitHub Issue: #48947
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48990: [Ruby] Add support for writing date arrays (#48991)
### Rationale for this change
There are date32 and date64 variants for date arrays.
### What changes are included in this PR?
* Add `ArrowFormat::DateType#to_flatbuffers`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48990
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993)
### Rationale for this change
It's a large variant of UTF-8 array.
### What changes are included in this PR?
* Add `ArrowFormat::LargeUTF8Type#to_flatbuffers`
* Add support for large UTF-8 array of `#values` and `#raw_records`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48992
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982)
### Rationale for this change
`FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter.
### What changes are included in this PR?
Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation:
- Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods
- Deprecate the old Status/out-parameter overloads
- Update C++ callers and R/Python/GLib bindings to use the new API
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
Status versions of FileReader::ReadRowGroup(s) have been deprecated.
```cpp
virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices,
std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
const std::vector<int>& column_indices,
std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
std::shared_ptr<::arrow::Table>* out);
```
* GitHub Issue: #48949
Lead-authored-by: fenfeng9 <fenfeng9@qq.com>
Co-authored-by: fenfeng9 <36840213+fenfeng9@users.noreply.github.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989)
### Rationale for this change
Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC.
### What changes are included in this PR?
* Refer arguments of `garrow_filter_node_options_new()`
* Refer arguments of `garrow_project_node_options_new()`
* Refer arguments of `garrow_aggregate_node_options_new()`
* Refer arguments of `garrow_literal_expression_new()`
* Refer arguments of `garrow_call_expression_new()`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48985
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007)
### Rationale for this change
When looking for the wheel the script was falling back to returning a 404 even when the wheel was found:
```
+ python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome
127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found
```
Timing out the job and failing.
### What changes are included in this PR?
Correct logic and only return 404 if the file requested wasn't found.
### Are these changes tested?
Yes via archery
### Are there any user-facing changes?
No
* GitHub Issue: #47692
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974)
### Rationale for this change
Benchmark failing since C++20 upgrade due to lack of C++20 configuration
### What changes are included in this PR?
Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach.
Description as follows:
> conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty.
>
> This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present.
### Are these changes tested?
I got :robot: to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly.
> Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch.
>
> The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding.
### Are there any user-facing changes?
Nope
* GitHub Issue: #48912
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718)
### Rationale for this change
Fixes https://github.com/apache/arrow/issues/36889
When writing CSV from a table where the first batch is empty, the header gets written twice:
```python
table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n <-- header appears twice
```
### What changes are included in this PR?
The bug happens because:
1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization
2. The buffer is not cleared after flush
3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_`
4. The write loop then writes `data_buffer_` which still contains stale content
The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths:
- `WriteHeader()`
- `WriteRecordBatch()`
- `WriteTable()`
This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.
### Are these changes tested?
Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
- Empty batch at start of table
- Empty batch in middle of table
### Are there any user-facing changes?
No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.
* GitHub Issue: #36889
Lead-authored-by: Ruiyang Wang <ruiyang@anthropic.com>
Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>
* GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933)
### Rationale for this change
#48932
### What changes are included in this PR?
- Fix `rsync` build error ODBC Nightly Package
### Are these changes tested?
- tested in CI
### Are there any user-facing changes?
- After fix, users should be able to get Nightly ODBC package release
* GitHub Issue: #48932
Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48951: [Docs] Add documentation relating to AI tooling (#48952)
### Rationale for this change
Add guidance re AI tooling
### What changes are included in this PR?
Updates to main docs and links to it from new contributor's guide
### Are these changes tested?
No but I'll built the docs
### Are there any user-facing changes?
Just docs
:robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness.
* GitHub Issue: #48951
Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49029: [Doc] Run sphinx-build in parallel (#49026)
### Rationale for this change
`sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs).
### Are these changes tested?
By existing CI jobs.
### Are there any user-facing changes?
No.
* GitHub Issue: #49029
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-33450: [C++] Remove GlobalForkSafeMutex (#49033)
### Rationale for this change
This functionality is unused now that we have a proper atfork facility.
### Are these changes tested?
By existing CI tests.
### Are there any user-facing changes?
Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal).
* GitHub Issue: #33450
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956)
### Rationale for this change
The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete.
It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies.
The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved.
### What changes are included in this PR?
Removed the outdated TODO comment that referenced GH-35437.
### Are these changes tested?
I did not test.
### Are there any user-facing changes?
No.
* GitHub Issue: #35437
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008)
### Rationale for this change
When running the python-sdist job we are currently not uploading the build artifact to the job.
### What changes are included in this PR?
Upload artifact as part of building the job so it's easier to test and validate contents if necessary.
### Are these changes tested?
Yes via archery.
### Are there any user-facing changes?
No
* GitHub Issue: #48586
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039)
### Rationale for this change
CI needs updating to test old R package versions
### What changes are included in this PR?
Add 22.0.0.1
### Are these changes tested?
Nah, it's CI stuff
### Are there any user-facing changes?
No
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969)
### Rationale for this change
See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default
### What changes are included in this PR?
Updating several doctest examples from `string` to `large_string`.
### Are these changes tested?
Yes, locally.
### Are there any user-facing changes?
No.
Closes #48961
* GitHub Issue: #48961
Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-49037: [Benchmarking] Install R from non-conda source for benchmarking (#49038)
### Rationale for this change
Slow benchmarks due to conda duckdb building from source
### What changes are included in this PR?
Try ditching conda and installing R via rig and using PPM binaries
### Are these changes tested?
I'll try running
### Are there any user-facing changes?
Nope
* GitHub Issue: #49037
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49042: [C++] Remove mimalloc patch (#49041)
### Rationale for this change
This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139
### Are these changes tested?
By existing CI.
### Are there any user-facing changes?
No.
* GitHub Issue: #49042
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49024: [CI] Update Debian version in `.env` (#49032)
### Rationale for this change
Default Debian version in `.env` now maps to oldstable, we should use stable instead.
Also prune entries that are not used anymore.
### Are these changes tested?
By existing CI jobs.
### Are there any user-facing changes?
No.
* GitHub Issue: #49024
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49027: [Ruby] Add support for writing time arrays (#49028)
### Rationale for this change
There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays.
### What changes are included in this PR?
* Add `ArrowFormat::TimeType#to_flatbuffers`
* Add bit width information to `ArrowFormat::TimeType`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49027
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49030: [Ruby] Add support for writing fixed size binary array (#49031)
### Rationale for this change
It's a fixed size variant of binary array.
### What changes are included in this PR?
* Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers`
* Add `ArrowFormat::FixedSizeBinaryArray#each_buffer`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49030
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867)
### Rationale for this change
Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations.
### What changes are included in this PR?
- Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error
- Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases
### Are these changes tested?
Yes
### Are there any user-facing changes?
No
* GitHub Issue: #48866
Authored-by: Arkadii Kravchuk <arkadii.kravchuk@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674)
### Rationale for this change
This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers.
I could not find the relevant example to demonstrate within this project but assume that we have a test such as:
(Generated by ChatGPT)
```cpp
TEST(BlockParser, ErrorMessageWithColonsPreserved) {
Status st(StatusCode::Invalid,
"CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
"Error details: Time format: 12:34:56, Key: value\n"
"parser_test.cc:940 Parse(parser, csv, &out_size)");
std::string expected_msg =
"Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
"Error details: Time format: 12:34:56, Key: value";
ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
// Test with URL-like data (another common case with colons)
TEST(BlockParser, ErrorMessageWithURLPreserved) {
Status st(StatusCode::Invalid,
"CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
"URL: http://arrow.apache.org:8080/api\n"
"parser_test.cc:974 Parse(parser, csv, &out_size)");
std::string expected_msg =
"Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
"URL: http://arrow.apache.org:8080/api";
ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
```
then it fails.
### What changes are included in this PR?
Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped.
### Are these changes tested?
Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`.
### Are there any user-facing changes?
No, test-only.
* GitHub Issue: #48673
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052)
### Rationale for this change
See: #49044
### What changes are included in this PR?
Urllib now request with `"user-agent": "pyarrow"`
### Are these changes tested?
It's a CI fix.
### Are there any user-facing changes?
No, just a CI test fix.
* GitHub Issue: #49044
Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988)
### Rationale for this change
Currently the files are missing from the published wheels.
### What changes are included in this PR?
- Ensure the license and notice files are part of the wheels
- Use build frontend to build wheels
- Build wheel from sdist
### Are these changes tested?
Yes, via archery.
I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing:
```
AssertionError: LICENSE.txt is missing from the wheel.
```
### Are there any user-facing changes?
No
* GitHub Issue: #48983
Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060)
### Rationale for this change
Fix two issues found by OSS-Fuzz in the IPC reader:
* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408
None of these two issues is a security issue.
### Are these changes tested?
Yes, by new unit tests and new fuzz regression files.
### Are there any user-facing changes?
No.
**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)
* GitHub Issue: #49059
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056)
### Rationale for this change
Decimal128/256 arrays are only supported.
### What changes are included in this PR?
Add `ArrowFormat::DecimalType#to_flatbuffers`.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49055
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49053: [Ruby] Add support for writing timestamp array (#49054)
### Rationale for this change
It has `unit` and `time_zone` parameters.
### What changes are included in this PR?
* Add `ArrowFormat::TimestampType#to_flatbuffers`
* Set time zone when GLib timestamp type is converted from C++ timestamp type
* Use `time_zone` not `timezone`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49053
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619)
### Rationale for this change
In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds.
### What changes are included in this PR?
IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation.
### Are these changes tested?
Yes, with the CI.
### Are there any user-facing changes?
Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst`
* GitHub Issue: #28859
Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: tadeja <tadeja@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066)
### Rationale for this change
The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types.
### What changes are included in this PR?
Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership.
### Are these changes tested?
Yes, existing tests.
### Are there any user-facing changes?
No.
* GitHub Issue: #49065
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063)
### Rationale for this change
Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed.
### What changes are included in this PR?
Refactor the Engine class to only create one target machine and pass that to the necessary functions.
Before the change (3 TargetMachines created):
First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout.
Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler.
Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine.
After the change (1 TargetMachine created):
The key changes are:
Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine).
Use SimpleCompiler instead of TMOwningSimpleCompiler:
SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created.
A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance.
### Are these changes tested?
Yes, unit and integration.
### Are there any user-facing changes?
No.
* GitHub Issue: #48159
Lead-authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Co-authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049)
### Rationale for this change
Prevent bugs similar to https://github.com/apache/arrow/issues/49043
### What changes are included in this PR?
- Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`.
- Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response.
### Are these changes tested?
Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression.
The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce.
### Are there any user-facing changes?
Some low probability bugs will be gone. No interface changes.
* GitHub Issue: #49043
Authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035)
### Rationale for this change
The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error.
### What changes are included in this PR?
Add kCanReturnErrors to the function definition to match other string functions.
Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation.
Add a unit test.
### Are these changes tested?
Yes, unit and integration testing.
### Are there any user-facing changes?
No.
* GitHub Issue: #49034
Authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981)
### Rationale for this change
Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11).
### What changes are included in this PR?
Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments.
### Are these changes tested?
Yes, through CI build and existing tests.
### Are there any user-facing changes?
No.
* GitHub Issue: #48980
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49069: [C++] Share Trie instances across CSV value decoders (#49070)
### Rationale for this change
The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead.
### What changes are included in this PR?
- Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie)
- Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries
- Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders
### Are these changes tested?
Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage.
### Are there any user-facing changes?
No.
* GitHub Issue: #49069
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49076: [CI] Update vcpkg baseline to newer version (#49062)
### Rationale for this change
The current version of vcpkg used is a from April 2025
### What changes are included in this PR?
Update baseline to newer version.
### Are these changes tested?
Yes on CI. I've validated for example that xsimd 14 will be pulled.
### Are there any user-facing changes?
No
* GitHub Issue: #49076
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49074: [Ruby] Add support for writing interval arrays (#49075)
### Rationale for this change
There are year month/day time/month day nano variants.
### What changes are included in this PR?
* Add `ArrowFormat::IntervalType#to_flatbuffers`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49074
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49071: [Ruby] Add support for writing list and large list arrays (#49072)
### Rationale for this change
They use different offset size.
### What changes are included in this PR?
* Add `ArrowFormat::ListType#to_flatbuffers`
* Add `ArrowFormat::LargeListType#to_flatbuffers`
* Add `ArrowFormat::VariableSizeListArray#child`
* Add `ArrowFormat::VariableSizeListArray#each_buffer`
* `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist
* Add `garrow_list_array_get_value_offsets_buffer()`
* Add `garrow_large_list_array_get_value_offsets_buffer()`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49071
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091)
### Rationale for this change
LLVM 15 or earlier uses `llvm::Optional` not `std::optional`.
### What changes are included in this PR?
Use `llvm::Optional` with LLVM 15 or earlier.
### Are these changes tested?
Yes, compiling.
### Are there any user-facing changes?
No
* GitHub Issue: #49087
Authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101)
### Rationale for this change
The Swift documentation link in the implementations.rst file was broken and returned a 404 error.
### What changes are included in this PR?
Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow)
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #49100
Lead-authored-by: ChiLin Chiu <chilin.chiou@gmail.com>
Co-authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49096: [Ruby] Add support for writing struct array (#49097)
### Rationale for this change
It's a nested array.
### What changes are included in this PR?
* Add `ArrowFormat::StructType#to_flatbuffers`
* Add `ArrowFormat::StructArray#each_buffer`
* Add `ArrowFormat::StructArray#children`
* Fix `ArrowFormat::Array#n_nulls`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49096
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49093: [Ruby] Add support for writing duration array (#49094)
### Rationale for this change
It has unit parameter.
### What changes are included in this PR?
* Add `ArrowFormat::DurationType#to_flatbuffers`
* Add duration support to `#values` and `raw_records`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49093
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099)
### Rationale for this change
Documents for libarrow-cuda-glib are generated but they aren't packaged.
### What changes are included in this PR?
Package documents for libarrow-cuda-glib.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49098
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48764: [C++] Update xsimd (#48765)
### Rationale for this change
Homogenized versions used
### What changes are included in this PR?
Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines.
### Are these changes tested?
Yes, with current CI.
In fact due to the absence of pin, part of the CI already runs xsimd 14.
### Are there any user-facing changes?
No.
* GitHub Issue: #48764
Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047)
### Rationale for this change
As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.
### What changes are included in this PR?
Remove asv benchmarking related files and docs.
### Are these changes tested?
No, Validate CI and run preview-docs to validate docs.
### Are there any user-facing changes?
No
* GitHub Issue: #46008
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109)
### Rationale for this change
`SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix.
### What changes are included in this PR?
Add f prefix to the string in `SparseCOOTensor.__repr__`.
### Are these changes tested?
Yes, work after adding. f-string prefix:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: float
shape: (2, 3)
```
### Are there any user-facing changes?
a bug that caused incorrect or invalid data to be produced:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: {self.type}
shape: {self.shape}
```
* GitHub Issue: #49108
Authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126)
### Rationale for this change
Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel).
### What changes are included in this PR?
Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025.
### Are these changes tested?
Yes, with extendeed dask build.
### Are there any user-facing changes?
No.
* GitHub Issue: #49083
Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-49117: [Ruby] Add support for writing union arrays (#49118)
### Rationale for this change
There are dense and sparse variants.
### What changes are included in this PR?
* Add `garrow_union_array_get_n_fields()`
* Add `ArrowFormat::UnionArray#children`
* Add `ArrowFormat::DenseUnionArray#each_buffer`
* Add `ArrowFormat::SparseUnionArray#each_buffer`
* Add `ArrowFormat::UnionType#to_flatbuffers`
* Add `Arrow::UnionArray#fields`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49117
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49119: [Ruby] Add support for writing map array (#49120)
### Rationale for this change
It's a list based array.
### What changes are included in this PR?
* Add `ArrowFormat::MapType#to_flatbuffers`
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #49119
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48922: [C++] Support Status-returning callables in Result::Map (#49127)
### Rationale for this change
Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases.
### What changes are included in this PR?
- Added EnsureResult specialization to allow Map to return Status directly.
- Added unit tests to verify success/error propagation and return type resolution.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No
* GitHub Issue: #48922
Authored-by: Abhishek Bansal <abhibansal593@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095)
### Rationale for this change
This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal.
`fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases.
### What changes are included in this PR?
Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion.
### Are these changes tested?
Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}.
### Are there any user-facing changes?
It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`.
With this patch, the CSV reader in PyArrow outputs:
```python
>>> import pyarrow
>>> import pyarrow.csv
>>> import io
>>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode()))
>>> print(table)
pyarrow.Table
data: double
----
data: [[0,inf,-inf]]
```
Closes #49003
* GitHub Issue: #49003
Authored-by: Alvaro-Kothe <kothe65@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943)
### Rationale for this change
The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.
### What changes are included in this PR?
Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.
### Are these changes tested?
There are existent tests for JSON.
### Are there any user-facing changes?
No, test-only.
* GitHub Issue: #48941
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49067: [R] Disable GCS on macos (#49068)
### Rationale for this change
Builds that complete on CRAN
### What changes are included in this PR?
Disable GCS by default
### Are these changes tested?
### Are there any user-facing changes?
Hopefully not
**This PR includes breaking changes to public APIs.** (If there are any
breaking changes to public APIs, please explain which changes are
breaking. If not, you can remove this.)
**This PR contains a "Critical Fix".** (If the changes fix either (a) a
security vulnerability, (b) a bug that caused incorrect or invalid data
to be produced, or (c) a bug that causes a crash (even when the API
contract is upheld), please provide explanation. If not, you can remove
this.)
* GitHub Issue: #49067
---------
Co-authored-by: Nic Crane <thisisnic@gmail.com>
* GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116)
### Rationale for this change
Current wheels are failing to be built due to old version of vcpkg failing with our latest main.
### What changes are included in this PR?
- Update vcpkg version.
- Update patches
- Add `perl-Time-Piece` to some images as required to build newer OpenSSL.
### Are these changes tested?
Yes on CI
### Are there any user-facing changes?
No
* GitHub Issue: #49115
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955)
### Rationale for this change
Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this:
```python
import pyarrow as pa
import pyarrow.compute as pc
pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32()))
# [0, 1, 2, 3]
pc.array_sort_indices(pa.DictionaryArray.from_arrays(
indices=pa.array([None, None, None, None], type=pa.int8()),
dictionary=pa.array([], type=pa.null())
))
# [0, 1, 2, 3]
```
I believe it does not make sense to specifically disallow this in dictionaries at this point.
### What changes are included in this PR?
Added a unittest for null sorting behaviour.
### Are these changes tested?
Yes, the unittest was added.
### Are there any user-facing changes?
No.
* GitHub Issue: #48954
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-36193: [R] arm64 binaries for R (#48574)
### Rationale for this change
Issues building on ARM
### What changes are included in this PR?
CI job and nixlibs update
### Are these changes tested?
On CI
### Are there any user-facing changes?
No
AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc)
* GitHub Issue: #36193
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-48397: [R] Update docs on how to get our libarrow builds (#48995)
### Rationale for this change
Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that.
### What changes are included in this PR?
Update docs.
### Are these changes tested?
Will preview docs build.
### Are there any user-facing changes?
Just docs.
* GitHub Issue: #48397
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105)
### Rationale for This Change
The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ.
### What Changes Are Included in This PR?
This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access.
### Are These Changes Tested?
Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically.
### Are There Any User-Facing Changes?
No. This change improves internal safety and robustness without altering public APIs or observable user behavior.
* GitHub Issue: #49104
Lead-authored-by: Alirana2829 <alimahmoodrana00@gmail.com>
Co-authored-by: Ali Mahmood Rana <159713825+AliRana30@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
* MINOR: [Docs] Add links to AI-generated code guidance (#49131)
### Rationale for this change
Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though
### What changes are included in this PR?
Add link to AI-generated code guidance
### Are these changes tested?
No
### Are there any user-facing changes?
No
Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* MINOR: [R] Add new vignette to pkgdown config (#49145)
### Rationale for this change
CI failing on preview-docs; see #49141
### What changes are included in this PR?
Add the vignette created in #49068 to pkgdown config
### Are these changes tested?
I'll trigger CI
### Are there any user-facing changes?
Nah
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
* GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088)
Fixes: #49150
See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381
### Rationale for this change
Fix CI failures
### What changes are included in this PR?
Tests are made more general to allow for Pandas 2 and Pandas 3 style string types
### Are these changes tested?
By CI
### Are there any user-facing changes?
No
* GitHub Issue: #49150
Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
* GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971)
Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix.
### Rationale for this change
I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced.
### What changes are included in this PR?
AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang.
This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type.
Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above.
### Are these changes tested?
I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows.
One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it.
### Are there any user-facing changes?
Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those.
* GitHub Issue: #41990
Lead-authored-by: Nate Prewitt <nateprewitt@microsoft.com>
Co-authored-by: Nate Prewitt <nate.prewitt@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139)
### Rationale for this change
We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138
### What changes are included in this PR?
Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2).
### Are these changes tested?
Tes.
### Are there any user-facing changes?
No.
* GitHub Issue: #49138
Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769)
### Rationale for this change
Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics).
The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working.
### What changes are included in this PR?
- Changed validation from `start >= stop` to `start > stop`
- Updated error message
- Added test cases
### Are these changes tested?
Yes, tests were added.
### Are there any user-facing changes?
Yes.
```python
import pyarrow.compute as pc
pc.list_slice([[1,2,3]], 0, 0)
```
Before:
```
pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0)
```
After:
```
<pyarrow.lib.ListArray object at 0x1a01b8b20>
[
[]
]
```
* GitHub Issue: #33459
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135)
Closes https://github.com/apache/arrow/issues/41863
### Rationale for this change
Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/
`LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec.
However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec:
```
ArrowException: Unsupported compression: lz4_raw
```
This is a friction issue, and confusing for some users who are aware of the differences.
### What changes are included in this PR?
- Adding `LZ4_RAW` to the acceptable codec names list.
- Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
- Adding a test
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes, an additive change to the accepted codec names.
* GitHub Issue: #41863
Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
* GH-48868: [Doc] Document security model for the Arrow formats (#48870)
### Rationale for this change
Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.
### What changes are included in this PR?
Add a Security Considerations page in the Format section.
**Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html
### Are these changes tested?
N/A
### Are there any user-facing changes?
No.
* GitHub Issue: #48868
Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
* GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005)
### Rationale for this change
#49004
### What changes are included in this PR?
- Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI.
Note: `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050
### Are these changes tested?
Yes, in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49004
Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: Alina (Xi) Li <96995091+alinaliBQ@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151)
### Rationale for this change
#49092
### What changes are included in this PR?
- Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly.
Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`.
### Are these changes tested?
Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26
### Are there any user-facing changes?
Yes, the nightly ODBC file names will be changed as described above.
* GitHub Issue: #49092
Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49156: [Python] Require GIL for string comparison (#49161)
### Rationale for this change
With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL.
### What changes are included in this PR?
Moving statement out of the `with nogil` context manager.
### Are these changes tested?
Existing CI builds pyarrow.
### Are there any user-facing changes?
No
* GitHub Issue: #49156
Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>
* GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577)
### Rationale for this change
#48575
### What changes are included in this PR?
- Add new ODBC workflow for macOS Intel 15 and 14 arm64.
- Added ODBC build fixes to enable build on macOS CI.
### Are these changes tested?
Tested in CI and local macOS Intel and M1 environments.
### Are there any user-facing changes?
N/A
* GitHub Issue: #48575
Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: justing-bq <62349012+justing-bq@users.noreply.github.com>
Co-authored-by: Victor Tsang <victor.tsang@improving.com>
Co-authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Co-authored-by: vic-tsang <victor.tsang@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165)
### Rationale for this change
Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments.
### What changes are included in this PR?
Use the variable name directly (no `${}`).
### Are these changes tested?
Yes.
### Are there any user-facing changes?
None.
* GitHub Issue: #49164
Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-48132: [Ruby] Add support for writing dictionary array (#49175)
### Rationale for this change
Delta dictionary message support is out of scope.
### What changes are included in this PR?
* Add `ArrowFormat::DictionaryArray#each_buffer`
* Add `ArrowFormat::DictionaryType#build_fb_type`
* Add support for dictionary message in `ArrowFormat::StreamingWriter`
* Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48132
Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* GH-49081: [C++][Parquet] Correct variant's extension name (#49082)
### Rationale for this change
Correct variant extension according to arrow's specification.
### What changes are included in this PR?
Modified variant's hardcoded extension name.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* GitHub Issue: #49081
Authored-by: Zehua Zou <zehuazou2000@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>
* GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618)
### Rationale for this change
This is the first in series of PRs adding type annotations to pyarrow and resolving #32609.
### What changes are included in this PR?
This PR establishes infrastructure for type checking:
- Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows
- Configures type checkers to validate stub files (excluding source files for now)
- Adds PEP 561 `py.typed` marker to enable type checking
- Updates wheel build scripts to include stub files in distributions
- Creates initial minimal stub directory structure
- Updates developer documentation with type checking workflow
### Are these changes tested?
No. This is mostly a CI change.
### Are there any user-facing changes?
This does not add any actual annotations (only `py.typed` marker) so user should not be affected.
* GitHub Issue: #32609
* GitHub Issue: #49102
Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
* GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192)
### Rationale for this change
See #49190
### What changes are included in this PR?
Fix `unknown job 'odbc' error` caused by typo
### Are these changes tested?
Tested in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49190
Authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
* MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191)
Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p>
<blockquote>
<h2>v3.7.0</h2>
<ul>
<li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li>
<li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li>
<li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect…
What changes are included in this PR?
Bug fixes and robustness improvements in the IPC file reader:
Testing improvements:
Miscellaneous:
Are these changes tested?
Yes, by existing and improved tests.
Are there any user-facing changes?
Bug fixes.
This PR contains a "Critical Fix". Fixes a potential crash reading variadic buffers with pre-buffering enabled.